Skip to main content

Centralized training

For centralized training, the setup is than for federated learning. Depending on whether you decide to train on cloud or Arc node, you need to create either a VM or Arc cluster.

Create a compute VM

While for a fed server, we are using attached compute, we can create a compute VM via "Compute clusters" selection of the "Compute" menu.

Depending on the size of your dataset, selected problem and required costs, you will need to choose one of the VMs. In some cases it may make more sense to invest into a smaller number of faster GPUs, instead of more slower GPUs. In our experiments we have utilized 4 GPUs.

Fig. 1 - Compute VM creation

Register Arc cluster

If you followed the whole HW setup part, you should have an ARC cluster ready and attached to the AML workspace.

Now you can head over to /dev/ihd-central-training folder and selecd ihd-central.ipynb.

Inside .ipynb

In this .ipynb, you first create a bunch of import needed to control the AML.

Afterwards, you can configure your own credentials to the workspace and RG and environment, where the code will be executed.

There are two examples, one using Azure ARC and other using cloud VM. When it comes to structure of the command, they are inherently the same.

The mount location and target are taken above the command. target specifies the cluster or attached compute where we want to execute the experiment. The mount_location tells where the data are located. They can be on premise with ARC, or in cloud with ARC/Azure VM.

Inside, you see parameters such as:

code - where folder with code is located, relative to the script

command - what command to execute, in this example it is a single script called train.py

environment - environment, where the script should run

compute - machine where the environment and code should be deployed and executed

display_name, description - how should the job be called and description of it

shm_size - how much memory can the job allocate

resources, distribution - tells about how many cluster nodes should be used and how many GPUs should a single node utilize

Centralized evaluation

As there were problems with evaluation due to the bad resiliency of distributed PyTorch training, it was separated into another script.

There is also another script for evaluation of already trained model from Model repository in your current workspace. It features one additional parameter model_name_and_version, which can be assembled easily - "azureml: + model name + :version". Model name and version can be found in model tab, after opening the model.

Fig. 2 - ML model stored in Azure